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MODEL ADAPTIVE APPARATUS AND MODEL ADAPTIVE 
METHOD, RECORDING MEDIUM, AND PATTERN RECOGNITION 

APPARATUS 

5 BACKGROUND OF THE INVENTION 
L Field of the Invention 

The present invention relates to a model adaptive apparatus and a 
model adaptive method, a recording medium, and a pattern recognition 
apparatus. More particularly, the present invention relates to a model 

10 adaptive apparatus and a model adaptive method, a recording medium, and 
a pattern recognition apparatus, which are suitable for use in a case in 
which, for example, speech recognition is performed, 
2. Description of the Related Art 

There have hitherto been known methods of recognizing words 

15 which are spoken in a noisy environment. Typical methods thereof are a 
PMC (Parallel Model Combination) method, an SS/NSS (Spectral 
Subtraction/Nonlinear Spectral Subtraction) method, an SFE (Stochastic 
Feature Extraction) method, etc. 

The PMC method has satisfactory recognition performance because 

20 information on environmental noise is taken directly iato a sound model, 
but calculation costs are high (since high-level computations are necessary, 
the apparatus is large, processing takes a long time, etc.). In the SS/NSS 
method, at a stage in which features of speech data are extracted, 
environmental noise is removed. Therefore, the SS/NSS method has a 

25 lower calculation cost than that of the PMC method and is widely used at 
the present time. In the SFE method, in a manner similar to the SS/NSS 
method, at a stage in which features of a speech signal containing 
environmental noise are extracted, the environmental noise is removed, and 
as features, those represented by a probabiUty distribution are extracted. 

30 The SFE method, as described above, differs from the SS/NSS method and 
the PMC method in which the features of speech are extracted as a point on 
the feature space, in that the features of speech are extracted as a 
distribution in the feature space. 

In each of the above-described methods, after the extraction of the 

35 features of speech, it is determined which one of the sound models 



corresponding to plural words, which are registered in advance, the features 
match best, and the word corresppnding to the sound model which matches 
best is output as a recognition result. 

The details of the SFE method are described in Japanese 
5 Unexamined Patent Application Publication No. 11-133992 (Japanese 
Patent Application No. 9-300979), etc., which was previously submitted by 
the applicant of this application. Furthermore, the details of the 
peiformance comparisons, etc., among the PMC method, the SS/NSS 
method, and the SFE method are described in, for example, "H. Pao, H. 

10 Honda, K. Minamino, M. Omote, H. Ogawa and N. Iwahashi, Stochastic 
Feature Extraction for Improving Noise Robustness in Speech Recognition, 
Proceedings of the 8th Sony Research Forum, SRF98-234, pp. 9-14, 
October 1998"; "N. Iwahashi, H. Pao, H. Honda, K. Minamino, and M. 
Omote, Stochastic Features for Noise Robust in Speech Recognition, 

15 ICASSP'98 Proceedings, pp. 633-636, May 1998"; "N. Iwahashi, H. Pao 
(presenter), H. Honda, K. Minamino and M. Omote, Noise Robust Speech 
Recognition Using Stochastic Representation of Features, ASr98-Spring 
Proceedings, pp. 91-92, March 1998"; "N. Iwahashi, H. Pao, H. Honda, K. 
Minamino and M. Omote, Stochastic Representation of Features for Noise 

20 Robust Speech Recognition, Technical Report of lEICE, pp. 19-24, SP97- 
97 (1998-01); etc. 

In the above-described SFE method, etc., environmental noise is not 
taken into account directly at the stage of speech recognition, that is, 
information of environmental noise is not input directly into a no-speech 

25 sound model, causing a problem of inferior recognition performance to 
occur. 

Furthermore, due to the fact that information on environmental noise 
is not taken directly into a no-speech sound model, there is another problem 
in that recognition performance is decreased as the time from the start of the 
30 speech recognition until the start of speech production is increased. 
SUMMARY OF THE INVENTION 

The present invention has been achieved in view of such 
circumstances. An object of the present invention is to prevent a decrease 
in recognition performance as the time from the start of speech recognition 



until the start of speech production is increased by correcting a no-speech 
sound model by using environmental noise information. 

To achieve the above-mentioned object, in a first aspect, the present 
invention provides a model adaptive apparatus comprising model 
adaptation means for performing an adaptation of a predetemiined model 
used in pattem recognition on the basis of extracted data ia a predetermined 
interval and the degree of freshness representiug the recentness of the 
extracted data. 

The pattem recognition may be performed based on a feature 
distribution in a feature space of iaput data. 

The model adaptation means may perform an adaptation of the 
predetermined model by using, as an degree of freshness, a ftmction in 
which the value changes in such a manner as to correspond to the time- 
related position of the extracted data in the predetermined interval. 

The fimction may be a monotonically increasing fimction which 
increases as time elapses. 

The fimction may be a linear or nonlinear fimction. 

The fimction may take discrete values or continuous values. 

The function may be a second-order fimction, a third-order fimction, 
or a higher-order fimction. 

The fimction may be a logarithmic fimction. 

The input data may be speech data. 

The predetermined model may be a sound model representing noise 
in an interval which is not a speech segment. 

Data extraction means may optionally comprise: 

• framing means having an input for receiving a source of speech and/or 
environmental noise and for producing in response data frames; 

• noise observation interval extraction means for extracting a noise vector 
for a number (m) of frames in a noise observation interval; 

• feature extraction means responsive to the noise vector (a) and to an 
observation vector in a speech recognition interval to produce a feature 
vector (y); and 

• no-speech sound model correction means responsive to the noise vector. 

In an embodiment, the apparatus may optionally also comprise: 

• power spectrum analysis means for receiving the extracted data; 



• noise characteristic calculation means responsive to environmental 
noise; and 

• featm*e distribution parameter calculation means for producing a feature 
distribution parameter in response to the power spectrum analysis means 

5 and the noise characteristic calculation means. 

The apparatus of the above embodiment may optionally further 
comprise: 

• a plurality of identification function computation means of which one at 
least receives a no-speech model, the means receiving the feature 

10 distribution parameter and producing in response a respective 
identification function; and 

• determination means responsive to the identification functions to 
produce a recognition result on the basis of a closest match. 

The apparatus may optionally comprise: 
1 5 • feature extraction means for extracting the features of the input data; 

• storage means for storing a predetermined number of models into which 
the input data is to be classified; and 

• classification means for classifying the features of the input data, 
corresponding to a predetermined model, which is observed in a 

20 predetermined interval, and for outputting the data as extracted data. 

In a second aspect, the present invention provides a model adaptive 
method comprising a model adaptation step of performing an adaptation of 
a predetermined model on the basis of the extracted data in a predetermined 
interval and the degree of fi*eshness representing the recentness of the 

25 extracted data. 

In a third aspect, the present invention provides a recording medium 
having recorded therein a program comprising a model adaptation step of 
performing an adaptation of a predetermined model on the basis of 
extracted data in a predetermined interval and the degree of freshness 

30 representing the recentness of the extracted data. 

In a fourth aspect, the present inventioii provides a pattem 
recognition apparatus comprising model adaptation means for performing 
an adaptation of a predetermined model on the basis of extracted data in a 
predetermined interval and the degree of freshness representing the 

3 5 recentness of the extracted data. 



In the model adaptive apparatus and the model adaptive method, the 
recording medium, and the pattern recognition apparatus of the present 
invention, an adaptation of a predetermined model is performed based on 
extracted data in a predetermined interval and the degree of freshness 
representing the recentness of the extracted data. 

The above and further objects, aspects and novel features of the 
invention will become more fiilly apparent from the following detailed 
description when read in conjunction with the accompanying drawings. 
BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram showing an example of the construction of 
a speech recognition apparatus according to the present invention. 

Fig. 2 is a diagram illustrating the operation of a noise observation 
interval extraction section 3 of Fig. L 

Fig. 3 is a block diagram showing a detailed example of the 
construction of a feature extraction section 5 of Fig. 1. 

Fig. 4 is a block diagram showing a detailed example of the 
construction of a speech recognition section 6 of Fig. 1 . 

Fig. 5 is a diagram showing an HMM (Hidden Markov Model). 

Fig. 6 is a diagram showing simulation results. 

Fig. 7 is a diagram showing a normal distribution of a no-speech 
sound model. 

Fig. 8 is a block diagram showing an example of the construction of 
a no-speech sound model correction section 7 of Fig. 1. 

Fig. 9 is a diagram showing a state in which a discrete value is 
converted into a continuous value. 

Fig. 10 is a diagram showing a general freshness fimction F(x). 

Fig. 11 is a diagram showing a first example of the freshness 
fimction F(x). 

Fig. 12 is a dia^a^ showing a second example of the freshness 
function F(x). 

Fig. 13 is a diagram showing a third example of the freshness 
fimction F(x). 

Fig. 14 is a diagram showing a fourth example of the freshness 
fimction F(x). 



• # 



Fig. 15 is a diagram showing a fifth example of the freshness 
function F(x). 

Fig. 16 is a diagram showing a sixth example of the freshness 
function F(x). 

5 Fig. 17 is a block diagram showing an example of the construction of 

an embodiment of a computer according to the present invention. 
DESCRIPTION OF THE PREFERRED EMBODIMENT 

Fig. 1 shows an example of the construction of an embodiment of a 
speech recognition apparatus according to the present invention. In this 

10 speech recognition apparatus, a microphone 1 collects produced speech 
which is the object for recognition, together with environmental noise, and - 
outputs it to a framing section 2. The framing section 2 extracts speech data 
input from the microphone 1 at a predetermined time interval (for example, 
10 milliseconds), and outputs the extracted data as data of one frame. The 

15 speech data in imits of one frame, which is output by the framing section 2, 
is supplied, as an observation vector "a" in which each of the speech data in 
a time series which form that frame is a component, to a noise observation 
interval extraction section 3 and to a feature extraction section 5. 
Hereinafter, where appropriate, an observation vector which is speech data 

20 of a t-th frame is denoted as a(t). 

The noise observation interval extraction section 3 buffers the speech 
data in frame units, which is input from the framing section 2, by an amoimt 
of a predetermined time (by an amount of M or more frames), extracts an 
observation vector "a" for M frames in a noise observation interval Tn 

25 which is from a timing tb at which a speech production switch 4 is turned on 
to a timing ta which is previous by an amount of M frames, and outputs it to 
the feature extraction section 5 and a no-speech sovmd model correction 
section 7. 

The speech production switch 4 is turned on by a user when the user 
30 starts to produce speech and is tumed off when the speech production is 
terminated. Therefore, the produced speech is not contained in the speech 
data before tuning tb (noise observation interval Tn) at which the speech 
production switch 4 is turned on, and only environmental noise is present. 
Furthermore, the interval from the timing tb at which the speech production 
35 switch 4 is tumed on to a timing td at which the speech production switch 4 



is turned off is a speech recognition interval, and the speech data in that 
speech recognition interval is an object for speech recognition. . 

The feature extraction section 5 removes the environmental noise 
components from the observation vector "a" in the speech recognition 
5 interval after timing tb, which is input from the framing section 2, on the 
basis of the speech data in which only the environmental noise in the noise 
observation interval Tn, which is input from the noise observation interval 
extraction section 3, is present, and extracts the features. That is, the 
feature extraction section 5 performs a Fourier transform on, for example, 

10 the true (the environmental noise is free removed) speech data as the 
observation vector "a" in order to determine the power spectrum thereof, 
and calculates a feature vector y in which each frequency component of the 
power spectrum is a component. The method of calculating the power 
spectrum is not lunited to a method using a Fourier transform. That is, in 

15 addition, the power spectrum can be determined, for example, by what is 
conmionly called a filter bank method. 

In addition, the feature extraction section 5 calculates a parameter 
(hereinafter referred to as a "feature distribution parameter") Z representing 
the distribution in a feature vector space, which is obtained when speech 

20 contained in speech data as an observation vector "a" is mapped into a 
space (the feature vector space) of the features, on the basis of the 
calculated feature vector y, and siippUes it to the speech recognition section 
6. 

Fig, 3 shows a detailed example of the construction of the feature 
25 extraction section 5 of Fig. 1. In the feature extraction section 5, the 
observation vector "a" input from the framing section 2 is supplied to a 
power spectrum analysis section 11. In the power spectrum analysis section 
11, the observation vector "a" is subjected to a Fourier transform by, for 
example, an FFT (Fast Fourier Transform) algorithm, thereby the power 
30 spectrum of the speech is extracted as a feature vector. Herein, it is 
assumed th^t the observation vector "a" as speech data of one frame is 
converted into a feature vector (D-dimensional feature vector) formed of D 
components. 

Here, a feature vector obtained from an observation vector a(t) of the 
35 t-th frame is denoted as y(t). Furthermore, of the feature vector y(t), the 



spectrum component of the true speech is denoted as x(t), and the spectrum 
component of environmental noise is denoted as u(t). In this case, the 
spectrum component of the true speech can be expressed based on the 
following equation (1): 

5 

x(t) = y(t) - u(t) -..(1) 

wherein it is assumed that the environmental noise has irregular 
characteristics, and that the speech data as the observation vector a(t) is 

10 such that the environmental noise is added to the true speech component. 

In the feature extraction section 5, on the other hand, the 
environmental noise while or in the form of the speech data input from the 
noise observation interval extraction section 3 is input to the noise 
characteristic calciilation section 13. In the noise characteristic calculation 

15 section 13, the characteristics of the environmental noise in the noise 
observation interval Tn are determined. 

More specifically, herein, assxuning that the distribution of the power 
spectrum u(t) of the enviroimiental noise in the speech recognition interval 
is the same as that of the environmental noise in the noise observation 

20 interval Tn immediately before that speech recognition interval and that the 
distribution is a normal distribution, in the noise characteristic calculation 
section 13, a mean value (mean vector) of the environmental noise and the 
variance (variance matrix) thereof are determined. 

A mean vector |i' and a variance matrix can be determined based 

25 on the following equation (2): 

^^'(i)=i^ZyW(i) 

2^*(U)-j^Z(yW(0-^^Xi))(y(t)0) -M^^^^ (2) 

30 

where the mean vector p.'(i) represents the i-th component of the mean 
vector ^i' (i = I, 2, D), y(t)(i) represents the i-th component of the feature 
vector of the t-th frame, and j) represents the component of the i-th row 
and the j-th colunm of the variance matrix G = 1, 2, D). 



Here, in order to reduce the number of calculations, regarding the 
environmental noise, it is assumed that the components of the feature vector 
y are not in correlation with each other. In this case, as shown in the 
following equation, the variance matrix is 0 except for the diagonal 
5 components. 

2:Xij)-0,i7^j ...(3) 

In the noise characteristic calculation section 13, in a manner as 

10 described above, the mean vector ^i' and the variance matrix L\ which 
define the normal distribution, as the characteristics of the environmental 
noise, are determined, and these are supplied to the feature distribution 
parameter calculation section 12. 

On the other hand, the output of the power spectrum analysis section 

15 11, that is, the feature vector y of the produced speech containing 
environmental noise, is supplied to the feature distribution parameter 
calculation section 12. In the feature distribution parameter calculation 
section 12, a feature distribution parameter representing the distribution 
(distribution of estimated values) of the power spectrum of the true speech 

20 is calculated based on the feature vector y firom the power spectrum 
analysis section 1 1 and the characteristics of the environmental noise fi-om 
the noise characteristic calculation section 13. 

More specifically, in the feature distribution parameter calculation 
section 12, assimiing that the distribution of the power spectrum of the true 

25 speech is a normal distribution, the mean vector <^ thereof and the variance 
matrix ^ thereof are calculated as feature distribution parameters based on 
the following equations (4) to (7): 

^(t)(i) = E[x(t)(i)] 
30 = E[y(t)(i) - u(t) (i)] 

= r"(y(.)(.)-u(.)(i)) ^„,, K""W> du(,)(i) 

£ P(u(t)(i))du(t)(i) 
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y(t)(i)f "'P(u(t)(i)) du(t)(i)-f " u(t)(i) P(u(t)(i)) du(t)(i) 



Mm) 



ry(«Xi) 

=y(t)(i) 



P(u(t)(i))du(t)(i) 
^ u(t)(i)P(u(t)(i))du(t)(i) 



(4) 



Jl 'P(u(t)(i))du(t)(i) 

When i = j, 

^ (t)(ij) = V[x(t)(i)] 

= E[(x(t)(i))-]-(E[x(t)(i)])^ 
(= E[(x(t)(i))']-(^(t)(i))^) 

Wheni^tj, 

^(t)(ij) = 0 ...(5) 
E[(x(t)(i))^]=E[(y(t)(i) - u(t)(i))^] 

, =r'\y(t)(i)-u(t)(i))^-^ du(t)(i) 

J° J„ P(u(t)(i))du(t)(i) 



f*^'V(u(t)(i))du(t)(i) 
-2y(t)(i)£^'^\(t)(i)P(u(t)(i))du(t)(i) 
+|J^'^'\u(t)(i))^ P(u(t)(i)) du(t)(i) 



xi:y(tXi))'f'^''P(u(t)(i))du(t)(i) 



r^''u(t)(i)P(u(t)(i))du(t)(i) 

=(y(tXi))'-2y(t)0 )^° ,y(,xi) 



£'^'V(u(t)(i))du(t)(i) 



T 
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£^^'^'\u(tXi))-P(u(t)(i))du(t)(i) 
• £^''^''p(u(t)(i))du(t)(i) 



... (6) 



P(u(t)(i))= 



1 



e 2zxti) 



(u(tXi)-M'(i))' 



... (7) 



10 

:_=! 

i— ^ 
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'=5 = 
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where ^(t)(i) represents the i-th component of the mean vector ^(t) in the 
t-th frame, E[] means a mean value within [], x(t)(i) represents the i-th 



u(t)(i) represents the i-th component of the power spectrmn of the 
environmental noise in the t-th frame, and P(u(t)(i)) represents the 
probability that the i-th component of the power spectrum of the 
enviromnental noise in the t-th frame is u(t)(i). Herein, since a normal 
distribution is assmned as the distribution of the environmental noise, 
P(u(t)(i)) can be expressed as shown in equation (7). 

Furthermore, 4^(t)(i, j) represents the component of the i-th row and 
the j-th colimMi of the variance ^(t) in the t-th frame. In addition, V[] 
represents the variance within []. 

hi the feature distribution parameter calculation section 12, in a 
manner as described above, for each frame, the mean vector ^ and the 
variance matrix T are determined as the feature distribution parameters 
representing the distribution (herein, the distribution in a case where the 
distribution in the feature vector space of the true speech is assumed to be a 
normal distribution) in the feature vector space of the true speech. 

Thereafter, the feature distribution parameters determined in each 
frame of the speech recognition interval are output to the speech recognition 
section 6. That is, if the speech recognition interval is T frames and the 
feature distribution parameter determined in each of the T frames is denoted 
as z(t) = {^(t), ^(t)} (t = 1, 2, T), the feature distribution parameter 
calculation section 12 supplies the feature distribution parameter (sequence) 
Z = {z(l), z(2), z(T)} to the speech recognition section 6. 

Referring again to Fig. 1, the speech recognition section 6 classifies 
the feature distribution parameter Z input from the feature extraction 
section 5 into one of a predetermined number K of sovmd models and one 



component of the power spectrum x(t) of the true speech in the t-th frame. 
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no-speech sound niodel and outputs the classified result as the recognition 
result of the input speech. That is, the speech recognition section 6 has 
stored therein, for example, an identification fimction (fimction for 
identifying whether the feature parameter Z is classified into a no-speech 
5 sound model) corresponding to a no-speech segment, and identification 
functions (functions for identifying whether the feature parameter Z is 
classified into any one of the sound models) corresponding to each of the 
predetermined number K of words, and calculates the value of the 
identification fimction of each sound model by using the feature distribution 

10 parameter Z from the feature extraction section 5 as an augment. Then, a 
sound model (word or no speech (noise)) in which the fimction value (what 
is commonly called a score) thereof is output as a recognition result. 

Fig. 4 shows a detailed example of the construction of the speech 
recognition section 6 of Fig. 1. The feature distribution parameter Z input 

15 from the feature distribution parameter calculation section 12 of the feature 
extraction section 5 is supplied to identification fimction computation 
sections 21-1 and 21-k, and an identification fimction computation section 
21-s. The identification fimction computation section 21-k (k = 1, 2, K) 
has stored therein an identification function Gk(Z) for identifying a word 

20 corresponding to the k-th sound model of K sound models, and computes 
the identification fimction Gic(Z) by using the feature distribution parameter 
Z from the feature extraction section 5 as an augment. The identification 
fimction computation section 21-s has stored therein an identification 
fimction Gs(Z) for identifying a no-speech segment coiresponding to the no- 

25 speech sound model, and computes the identification function Gs(Z) by 
using the feature distribution parameter Z from the feature extraction 
section 5 as an augment. 

In the speech recognition section 6, identification (recognition) of a 
word or no speech as a class is performed by using, for example, an HMM 

30 (Hidden Markov Model) method. 

The HMM method will now be described with reference to Fig. 5. In 
Fig. 5, the HMM has H states qi to qn, and for the state transition, only the 
transition to oneself and the transition to the state adjacent to the right are 
permitted. Furthermore, the initial state is set to be the leftmost state qi, the 

35 final state is set to be the rightmost state qH, and the state transition from the 
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final state qu is prohibited. In a manner as described above, a model in 
which there is no transition to the state to the left of oneself is called a left- 
to-right model, and in the speech recognition, generally, a left-to-right 
model is used. 

5 If a model for identifying a k class of the HMM is referred to as a k- 

class model, the k-class model is defined by, for example, a probability 
(initial state probability) 7rk(qh) in which the model is initially in a state qh, a 
probability (transition probability) ak(qi, qj) in which the model is in a state 
qi in a time (frame) t and transitions to a state qj at the next time t+l, and a 
10 probability (output probability) bic(qi) in which the state qi outputs a feature 
vector O when the state transition occurs from the state qi (h = 1, 
2, ...,H). 

In a case where a feature vector sequence Oi, O2, ... is given, for 
example, the class of a model in which the probability (observation 

15 probability) at which such a feature vector sequence is observed is highest 
is asstuned to be a recognition result of the feature vector sequence. 

Herein, this observation probability is determined by the 
identification fimction Gk(Z). That is, the identification fimction Gk(Z) is 
given based on the following equation (8) by assuming that, in the optimum 

20 state sequence (the manner in which the optimum state transition occurs) 
with respect to the feature distribution parameter (sequence) Z = {zi, Z2, 
zj), the identification ftmction Gk(Z) determines the probability at which 
such a feature distribution parameter (sequence) Z {zi, Z2, Zj} is 
observed: 

25 

gi^(Z>= max Kk{q^ bk{q^izd ' ak{qi.q^-bk{qi)(zi) 

qi,q2,.-.,qT 

• • • di}MT-\AT) * bk(qT)(2T) 

-(8) 

30 where bk'(qi)(2j) represents the output probability when the output is a 
distribution represented by Zj. For the output probabihty bk(s)(Ot) which is 
a probability at which each feature vector is output during a state transition, 
herein, a normal distribution fimction is used by assuming that there is no 
correlation among the components in the feature vector space. In this case, 

35 when the input is a distribution represented by Zt, the output probabihty 



bk (s)(zt) can be determiaed based on the following equation (9) by using a 
probability density function Pij:'"(s)(x) which is defined by the mean vector 
fik(s) and the variance matrix Sk(s), and a probability density function 
P^(t)(x) representing the feature vector (here, the power spectrum) x of the 
' 5 t-th frame: 

bk(s)(zt)=y(t)(x)p;:^(s)(x)dx 

=nP(s)(i)(^(t)(i)vt)((i,i)) 

1=1 

k=l, 2, K:s=qi, qz, qT:T=l, 2, T 
10 ... (9) 

where the integration interval of the integration in equation (9) is the 
entirety of the D-dimensional feature vector space (here, the power 
spectrum space). 

15 Furthermore, in equation (9), P(s)(i)(^(t)(i), ^(t)(i, i)) is expressed 

based on the following equation (10): 

P(s)(i)(^(t)(i), vp(t)(i, i)) 



V27uc(k(s)(M)+vi/(t)(i,i) 
20 --(10) 

where ^ic(s)(i) represents the i-th component of the mean vector ^k(s), and 
Zk(s)(i, i) represents the component of the i-th row and the i-th column of 
the variance matrix Zk(s). The output probability of the k-class model is 

25 defined by these components. 

The HMM, as described above, is defined by the initial state 
probability 7ik(qh), the transition probability ak(qi, qj), and the output 
probability bfc(qi)(0). These probabilities are determined in advance by 
calculating a feature vector from the speech data for learning and by using 

30 the feature vector. 

Herein, as the HMM, when that shown in Fig. 5 is used, since the 
transition always starts from the leftmost state qi, only the initial state 
probability corresponding to the state qi is set to "1", and all the toitial state 



probabilities corresponding to the other states are set to "0". Furthermore, 
as is clear from equations (9) and (10), if ^(t)(i, i) is set to "0", the output 
probability matches the output probability in a continuous HMM in a case 
where the variance of the feature vector is not taken into account. 
5 As a method of learning an HMM, for example, a Baum- Welch's 

reestimation method, etc., is known. 

Referring again to Fig. 4, the identification function computation 
section 21-k (k = 1, 2, K) has stored therein, with respect to the k-class 
model, the identification function Gk (Z) of equation (8) which is defined by 

10 the initial state probability 7tk(qh) which is determined in advance by 
learning, the transition probability a^Cqi, qj), and the output probability 
bk(qi)(0). The identification fimction computation section 21-k computes 
the identification function Gk(Z) by using the feature distribution parameter 
Z from the feature extraction section 5 as an augment, and outputs the 

15 function value (the above-described observation probability) Gk(Z) thereof 
to a determination section 22. The identification function computation 
section 21-s has stored therein an identification function Gs(Z) similar to the 
identification function Gk(Z) of equation (8), which is determined by the 
initial state probability TZsi^jtd supplied from the no-speech soimd model 

20 correction section 7, the transition probability as(qi, qj), and the output 
probability bs(qi)(0). The identification function computation section 21-s 
computes the identification fimction Gs(Z) by using the feature distribution 
parameter Z from the feature extraction section 5 as an augment, and 
outputs the function value (the above-described observation probability) 

25 Gs(Z) thereof to the determination section 22. 

In the determination section 22, with respect to the function value 
Gk(Z) (it is assumed herein that it contains the function value Gs(Z)) from 
each of the identification function computation sections 21-1 and 21-k, and 
the identification function computation section 21-s, for example, by using 

30 the determination rule shovm in the following equation (11), the feature 
distribution parameter Z, that is, the class (sound model) to which the input 
speech belongs, is identified: 
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C(Z)=Ck. if Gk(Z) =max{Gi(Z)} 

...(11) 



where C(Z) represents the function for performing an identification 
operation (process) for identifying a class to which the feature distribution 
parameter Z belongs, and furthermore, max in the right side of the second 
5 equation of equation (11) represents the maximum value of the function 
value Gi(Z) (here, i = s, 1, 2, K) which follows. 

When the detennination section 22 determines the class on the basis 
of equation (11), the determination section 22 outputs the class as a 
recognition result of the input speech. 

10 Referring again to Fig. 1, the no-speech sound model correction 

section 7 creates the identification function Gs(Z) corresponding to the no- 
speech sound model stored in the speech recognition section 6 on the basis 
of the environmental noise as the speech data in the noise observation 
interval Tn, which is input firom the noise observation interval extraction 

15 section 3, and supplies it to the speech recognition section 6. 

Specifically, in the no-speech sound model correction section 7, a 
feature vector X is observed with respect to each of M firames of the speech 
data (environmental noise) in the noise observation interval Tn, which is 
input from the noise observation interval extraction section 3, and the 

20 feature distribution thereof is created. 

{Fi(X),F2(X),...,Fm(X)} .,.(12) 

The feature distribution {Fi(X), i = 1, 2, M} is a probability density 
25 function, and hereinafter is also referred to as a "no-speech feature 
distribution PDF". 

Next, the no-speech feature distribution PDF is mapped into a 
probability distribution Fs(X) corresponding to the no-speech sound model 
on the basis of equation (13). 

30 

F3(X) = V(Fi(X), F2(X), Fm(X)) ... (13) 

where V is a correction fimction (mapping function) for mapping the no- 
speech feature distribution PDF{Fi(X), i = 1, 2, M} into the no-speech 
35 sound model Fs(X). 



For this mapping, various methods can be conceived by the 
. description of the no-speech feature distribution PDF, for example, 

Fs(x)-a(Fi(X), F2(X),-.-, fJX)M). Fi(X) (14) 

i=l 

5 

=SPi.Fi(X) ...(15) 

where Pi(Fi(X), F2(X), Fm(X), M) is a weighting function corresponding 
to each no-speech feature distribution and hereinafter is referred to as "pi". 
10 The weighting function Pi satisfies the conditions of the foUowiag equation 
(16): 

i;Pi(Fi(X),F2(X), ...,FM(X),M)=|;Pi=l ...(16) 

i=l i=l 

15 Here, if it is assumed that the probability distribution FspC) of the no- 

speech sound model is a normal distribution and that the components which 
form the feature vector of each frame are not in correlation with each other, 
a covariance matrix Zi of the no-speech feature distribution PDF{Fi(X), 
i = 1, 2, M} is a diagonal matrix. However, the precondition for this 

20 assumption requires that the covariance matrix of the no-speech sound 
model also be a diagonal matrix. Therefore, if the components which form 
the feature vector of each frame are not m correlation with each other, the 
no-speech feature distribution PDF{Fi(X), i = 1, 2, M} is a normal 
distribution G(Ei, Zi) having a mean and a variance corresponding to each 

25 component. Ei is the mean value of Fi(X) (hereinafter also referred to as an 
"expected value") where appropriate, and Si is the covariance matrix of 
Fi(X). 

In addition, if the mean of the no-speech feature distribution 
correspondiag to M frames of the noise observation interval Tn is denoted 
30 as p-i and the variance thereof is denoted as CTi^, the probability density 
function of the no-speech feature distribution can be expressed by the 
normal distribution G((ii, g^) (i = 1, 2, M). Based on the above 
assumption, by using the mean |ii and the variance corresponding to 



each frame, it is possible to compute the nomial distribution G(\Xsy Os^) 
(corresponding to the above-described Gs(Z)) which approximates the no- 
speech sound model Fs(X) by various methods described below. 

The first method of computing the normal distribution G(^s, cTs') of 
5 the no-speech sound model is a method in which the no-speech feature 
distribution {G{\i[, Gi^), i = 1, 2, M} is used, and as shown in the 
following equation (17), the mean of all of ^li is the mean value of the no- 
speech sound model, and as shown iu the following equation (18), the mean 
of all of is the variance of the no-speech sound model: 

0 

^ i=l 
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i=l 



15 where a and b are coefficients in which the optimum values are determined 
by simulation. 

A second method of computing the normal distribution G((is, ) of 
the no-speech sound model is a method in which those of the no-speech 
feature distribution {G(^i, ai^), i = 1, 2, M} having the expected value [i, 
20 are used, and based on the following equations (19) and (20), the mean 
value of the no-speech soimd model, and the variance a^^ thereof are 
computed: 



■■■(20) 



where a and b are coefficients in which the optimum values are determined 
by simulation. 

30 A third method of computing the normal distribution G(|ag, Os^) of 

the no-speech soxmd model is a method in which the mean value fis of the 
no-speech sound model and the variance thereof are computed by a 



combination of the no-speech feature distribution {G(/Xi, g-^), i = 1, 2, 
M}. 

In this method, the probability static of each no-speech feature 
distribution G(/ii, a{^) is denoted as X;: 

{Xi, X2, Xm) ...(21) 

Here, if the probabiUty static of the normal distribution G(/is, a^") of 
the no-speech sound model is denoted as Xs, the probability static Xs can be 
expressed by a linear combination of the probability static Xj and the 
weighting function pi, as shown in the following equation. (22). The 
weighting function P; satisfies the condition of equation (16). 

Xs=Spi • Xi ... (22) 

i=I 



The normal distribution G(/Xs, a^^) of the no-speech sound model can 
be expressed as shown in the following equation (23): 

G(n,,cD =G(f^^.ji,,f^pfoh ... (23) 

i=I i=l 

In equation (23), the weighting function Pi can generally be, for 
example, 1/M. In this case, the mean value fi^ of equation (23) and the 
variance thereof are determined by using predetermined coefficients, for 
example, as shown in the following equations. 

a 

/^s — -IlMi ...(24) 
<^s'=r^-Sa? ...(25) 

where a and b are coefficients in which the optimum values are determined 
by simulation. 

In a fourth method of computing the normal distribution G(/is, Os^) of 
the no-speech sound model, a statistical population Qj = [f^,-] 



• '0 • 
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corresponding to the probability static Xj of the no-speech feature 
distribution {G(|ii, ai^), i = 1, 2, M} is assumed. Herein, if (N; = N; i = 
1, 2, M} is assxrmed, the mean value can be obtained based on the 
following equation (26), and the variance o-^ can be obtained based on the 
follovraxg equation (28): 

^.4Zfu - (26) 

a?=^Z(f^^;-M> - (27) 

j=i 

=^hlr^^. -(28) 

By rearranging equation (28), the relationship of the following 
equation (29) holds: 

i^Z/;y=o?-HMf - (29) 

M 

Herein, if the sum Q of the statistical population, f^HjQ, , is taken 

into account, the following equations (30) and (31) are derived from 
20 equation (26), and the following equations (32) to (34) are derived from 
equation (29): 

25 =,^I:m, ...(31) 

i=l 

, M N 

os-j^ZZ(fi,rM> •- (32) 

^^^^ i=l j=l 



15 



21 



'i^fZdr^l ■• (33) 

^^^^ i=l j^l 

=I^Z(aM)-J^s -(34) 

In practice, equations (31) and (34) are used by multiplying 
coefficients thereto: 

^^s=^ZM. - (35) 

as'=b-(^2(<J?+Mf)-^> - (36) 

i=l 

where a and b are coefficients in which the optimum values are determined 
by simulation. 

15 Furthermore, as shown in the following equation (37), a coefficient 

may be multiplied to only the variance a^^. 
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20 Next, the operation of the speech recognition apparatus of Fig. 1 is 

described. 

Speech data (produced speech containing environmental noise for 
the object of recognition) collected by the microphone 1 is input to the 
framing section 2, whereby the speech data is formed into frames, and the 

25 speech data of each frame is supplied, as an observation vector "a", to the 
noise observation interval extraction section 3 and the feature extraction 
section 5 in sequence. In the noise observation interval extraction section 3, 
speech data (environmental noise) in the noise observation interval Tn 
before timing tb at which the speech production switch 4 is turned on is 

30 extracted, and the speech data is supplied to the feature extraction section 5 
and the no-speech sound model correction section 7. 



In the no-speech sound model correction section 7, based on the 
environmental noise as the speech data in the noise observation interval Tn, 
updating (adaptation) of the no-speech sound model is performed by one of 
the above-described first to fourth methods, and the model is supphed to the 
5 speech recognition section 6. In the speech recognition section 6, an 
identification fimction corresponding to the no-speech sound model, which 
is stored up to that time, is updated by the identification function as the no- 
speech sound model supplied firom the no-speech sound model correction 
section 7. That is, an adaptation of the no-speech sound model is 
10 performed. 

In the feature extraction section 5, on the other hand, the speech data 
as the observation vector "a" fi^om the firaming section 2 is subjected to 
sound analysis in order to determiae the feature vector y thereof. 
Furthermore, in the feature extraction section 5, based on the determined 

15 feature vector y, a feature distribution parameter Z representing the 
distribution in the feature vector space is calculated and is supphed to the 
speech recognition section 6. In the speech recognition section 6, by using 
the feature distribution parameter fi'om the feature extraction section 5, the 
value of the identification function of the sound model corresponding to no 

20 speech and each of a predetermined number K of words is computed, and a 
sound model in which the function value thereof is a maximum is output as 
the recognition result of the speech. 

As described above, since the speech data as the observation vector 
"a" is converted into a feature distribution parameter Z representing the 

25 distribution in the feature vector space which is a space of the features 
thereof, the feature distribution parameter is such that the distribution 
characteristics of noise contained in the speech data are taken into 
consideration. Furthermore, since the identification function corresponding 
to the no-speech sound model for identifying (recognizing) no speech is 

30 updated on the basis of the speech data in the noise observation interval Tn 
immediately before speech is produced, it is possible to greatly improve the 
speech recognition rate. 

Fig. 6 shows results of an experiment (simulation) in which the 
change of the speech recognition rate was measured when the no-speech 



segment Ts (see Fig. 2) from when the speech production switch 4 is turned 
on until speech is produced is changed. 

In Fig. 6, the curve "a" shows results by a conventional method in 
which a no-speech sound model is not corrected (an adaptation of the no- 
5 speech sound model is not performed), the curve "b" shows results by the 
first method, the curve "c" shows results by the second method, the curve 
"d" shows results by the third method, and the curve "e" shows results by 
the fourth method. 

The conditions of the experiment are as follows. The speech data 

10 used for recognition is collected within a car traveling on an expressway. 
The noise observation interval Tn is approximately 0.2 seconds in 20 
frames. The no-speech segment Ts was set to 0.05, 0.1, 0.2, 0.3, and 0.5 
seconds. In the extraction of the features of the speech data, analysis was 
performed (the features were obtained by MFCC (Mel-Frequency Cepstral 

15 Coefficients) analysis) in an MFCC domain. The number of people 
producing speech for the object of recognition is eight (four males and four 
females), and 303 words were spoken by each person. The number of 
words for which recognition was performed was 5000 words of Japanese. 
The sound model is an HMM, and learning was performed in advance by 

20 using speech data prepared for learning. In the speech recognition, a viterbi 
search method was used, and the beam width thereof was set to 3000. 

In the first, second, and fourth methods, the coefficient "a" was set to 
1.0, and the coefficient "b" was set to 0.1. In the third method, the 
coefficient "a" was set to 1.0, and the coefficient "b" was set to 1.0. 

25 As is clear from Fig. 6, in the conventional method (curve "a"), as 

the no-speech segment Ts is increased, the speech recognition rate is 
decreased considerably. In the first to fourth methods (curves "b" to "e") of 
the present invention, even if the no-speech segment Ts is increased, the 
speech recognition rate is decreased only slightly. That is, according to the 

30 preseht invention, even if the no-speech segment Ts is changed, it is 
possible for the speech recognition rate to be maintained at a particular 
level. 

In each of the above-described first to fourth methods, the mean 
value |j.s which defines the normal distribution G(^s, cTs^) of the no-speech 
35 sound model becomes a mean value of the mean value \i\ of the no-speech 
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feature distribution G(/Xi, a-^). Therefore, for example, if the mean value of 
the mean value of the no-speech feature distribution G(/Xi, o^^) is denoted 
as /X, and the normal distributions of the no-speech sound models, 
determined by the first to fourth methods, are denoted as Gsi(/x, Gsi'), GsiifJi, 

2 2 2 

<^s2 ), Gs3(m» <7s3 )» and Gs4(/i, ), respectively, these become distributions, 
in which the mean value /x is the center (center of gravity), in the feature 
space. 

The adaptation of a no-speech sound model by the above-described 
first to fourth methods, based on the no-speech feature distribution G(fii, 
Of), can be defined by the following equation (38) by using a mapping V. 
Hereinafter, where appropriate, GifXi, a-^) is described as Gi, and G(/Xs, Os^) 
is described as Gg. 

GsC) = V(Gi,G2, ...,Gi, ...) ...(38) 

Furthermore, herein, as the normal distribution G, a normal 
distribution is assumed, and the normal distribution is defined by a mean 
value and a variance. Therefore, if the mean value and the variance which 
define the normal distribution G are expressed by /Xg and Qs' as described 
above, the definition of equation (38) can also be expressed by equations 
(39) and (40) by using the mappings V^ 'and V^i of the mean value and the 
variance, respectively: 

(Ms = V^(Gi,G2,...) ...(39) 
as=Va2(Gi.G2>..-) ... (40) 

In the first to fourth methods expressed by the above-described 
mappings V (V^ and Yot), the no-speech feature distribution Gi, Go, Gm 
in a time series, obtained from each of the M frames in the noise 
observation interval Tn (Fig. 2), is treated equally. 

However, the environmental noise in the noise observation interval, 
strictly speaking, is not the same as the environmental noise in the noise 
observation interval Tn immediately before the noise observation interval, 
and furthermore, generally, it is estimated that the more distant from (the 
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start time tc of) the speech recognition interval, the more the environmental 
noise in the noise observation interval Tn differs from the environmental 
noise in the speech recognition interval. 

Therefore, the no-speech feature distribution Gi, G2, Gm in a time 
5 series, obtained from each of the M frames in the noise observation interval 
Tn (see Fig. 2), should be treated by weighting to those which are nearer to 
the speech recognition interval, rather than being treated equally (those 
which are more distant from the speech recognition interval should be 
treated without being given a weight). As a result of the above, an 

10 adaptation (correction and updating) of a no-speech sound model, which 
fiirther improves speech recognition accuracy, becomes possible. 

Accordingly, regarding the no-speech feature distribution Gi, G2, 
Gm obtained in the noise observation interval Tn, the degree of freshness 
representing the recentness thereof (here, corresponding to the recentness to 

1 5 the speech recognition interval) is introduced, and a method of performing 
an adaptation of a no-speech soimd model by taking this freshness into 
account is described below. 

Fig. 8 shows an example of the construction of the no-speech sound 
model correction section 7 of Fig. 1, which performs an adaptation of a no- 

20 speech sound model. 

A freshness function storage section 31 has stored therein 
(parameters which define) a freshness function which is a function 
representing the degree of freshness such as that described above. 

A sequence of observation vectors (here, speech data of M frames) 

25 as speech data (noise) in the noise observation interval Tn, output by the 
noise pjijj^^rvation interval extraction section 3, is input to a correction 
section 32. The correction section 32 obtains a no-speech feature 
distribution Gi, G2, Gm from this observation vector, and performs an 
adaptation of a no-speech sound model on the basis of this distribution and 

30 the freshness function stored in the freshness function storage section 31. 

Herein, the no-speech feature distribution Gi, G2, Gm contains 
discrete values observed in each of the M frames in the noise observation 
interval Tn. If the no-speech sound model correction section 7 is a system 
which processes discrete values, the no-speech feature distribution Gi, 

35 G2, . Gm, which contains discrete values, can be used as it is. However, in 



a case where the no-speech sound model correction section 7 is a system 
which processes continuous values, for example, as shown in Fig. 9, it is 
necessary to convert the no-speech feature distribution Gi, G2, Gm, 
which contains discrete values, into continuous values by a continuous 
5 converter, after which the values are processed by the no-speech sound 
model correction section 7. As a method of converting discrete values into 
continuous values, for example, there is a method of performing an 
approximation by a spline function. 

The discrete values are a finite number of observed values, observed 
10 at discrete times in a particular finite observation interval, and the 
continuous values are an infinite number of observed values, observed at 
arbitrary times, in a particular finite (or infinite) observation interval and 
are expressed by a particular function. 

In a case where the no-speech feature distribution used for an 
15 adaptation of a no-speech sound model contains discrete values, the 
freshness function also becomes a function of discrete values, and in a case 
where the no-speech feature distribution contains continuous values, the 
freshness fimction also becomes a function of continuous values. 

Next, a freshness function, and an adaptation of a no-speech sound 
20 model using the freshness function are described differently in a case where 
the freshness function contains discrete values and in a case where the 
freshness function contains continuous values. 

First, a freshness function F(x) can be defined as shown in, for 
example, equations (41) to (43) below: 

25 

F(x) = 0 ifx « Qobs - - (41) 

F(x2)>F(xi) ifx2>xi ...(42) 
Ja.^.F(x)dx = l ...(43) 

where Qobs represents the observation interval of the no-speech feature 
30 distribution, and in this embodiment, it corresponds to the noise observation 
interval Tn, 

Based on equation (41), the freshness function F(x) becomes 0 in 
other than the observation interval Qobs- Furthermore, based on equation 
(42), the freshness function F(x) is a^mction which increases as time 
35 elapses or which does not change (in this specification, referred to as a 



"monotonically increasing function") in the observation interval D^ob^- 
Therefore, basically, the nearer to the speech recognition interval (see Fig. 
2), the larger the value of the freshness function F(x). Fxirthermore, based 
on equation (43), the freshness function F(x) is a function in which when an 
5 integration is performed over the observation interval Qobs, the integrated 
value thereof becomes 1. Based on equations (41) to (43), the freshness 
function F(x) becomes, for example, such as that shown in Fig. 10. 

Herein, in this embodiment, the freshness function F(x) is used as a 
multiplier to be multiplied to the no-speech feature distribution, as will be 

10 described later. Therefore, the freshness function F(x) acts as a weight with 
respect to the no-speech feature distribution to which the value of the 
function is multipUed as a multiplier when the value of the function is 
positive or negative. Furthermore, the freshness function F(x) acts so as to 
invalidate the no-speech feature distribution to which the value thereof is 

15 multiplied as a multiplier when the value is 0 so that no influence is exerted 
on the adaptation of the no-speech sound model. 

In the correction section 32 of Fig. 8, by using the freshness function 
F(x) such as that described above and the no-speech feature distribution Gi, 
G2, . Gm, basically, the no-speech sound model Gs after adaptation can be 

20 determined based on equation (44): 

Gs-V(Gi, ...,Gm) 

M 

=2:F(x)G. ...(44) 

X=l 

25 According to equation (44), the no-speech feature distribution which 

is nearer to the speech recognition interval is treated by weighting, and an 
adaptation of a no-speech sound model is performed. As a result, it is 
possible to improve the speech recognition accuracy even more. 

Next, a specific example of the freshness function F(x), and an 

30 adaptation of a no-speech sound model using it are described. In the 
following, it is assumed that the observation interval Qobs of the no-speech 
feature distribution (in this embodiment, the noise observation interval Tn) 
is an interval in which x is from 0 to xm- Furthermore, as the fimction 
values of the freshness function F(x), the values of only the observation 



interval Qobs are considered (as shown in equation (41), since the function 
values are 0 in other than the observation interval Qobs, in die following that 
point is not mentioned). 

As the freshness function F(x), for example, a linear function can be 
5 used. In a case where continuous values are taken as the function values, 
the freshness function F(x) is expressed based on, for example, equation 
(45): 

F(x) = a-x ... (45) 

10 

a in equation (45) is a predetermined constant, and this constant a 
becomes 2/xm^ on the basis of the definition of the freshness function of 
equation (43). Therefore, the freshness function F(x) of equation (45) is 
expressed based on equation (46): 

15 . 

F(x)=^-x ... (46) 

XM 

Here, the freshness function F(x) shown in equation (46) is shown in 
Fig. 11: 

20 In this case, the no-speech sound model Gs after adaptation is 

determined based on equation (47): 

Gs=4^r^Gx(Mx^a^dx ... (47) 

Xm 

25 where Gx(|iii, oj) represents a no-speech feature distribution at time x, and 
^li and Ox are the mean value and the variance which define the normal 
distribution representing the no-speech feature distribution, respectively. 

Next, as the freshness function F(x), for example, a linear fimction 
which takes discrete values can be used. In this case, the freshness function 

30 F(x) is expressed based on, for example, equation (48): 



F(x) = a-x X = 1, 2, Xm 



(48) 



a in equation (48) is a predetennined constant, and this constant a 
becomes 2/(xm(xm-^1)) on the basis of the definition of the freshness 
function of equation (43). Therefore, the freshness function F(x) of 
equation (48) is expressed based on equation (49): 

5 

Herein, the- freshness function F(x) expressed by equation (49) is 
shown in Fig. 12. 

10 In this case, a no-speech sound model Gs after adaptation is 

detennined based on equation (50): 

Gs=E-flTTG. ...(50) 

15 where Gx represents the no-speech feature distribution at a sample point 
(sample time) x. 

Next, as the freshness fimction F(x), for example, a nonlinear 
function, such as an exponential function, a high-order binomial function, 
or a logarithmic function, can be used. In a case where as the freshness 
20 function F(x), for example, a second-order function as a high-order function 
which takes continuous values is used, the freshness function F(x) is 
expressed based on, for example, equation (51): 



25 



30 



F(x) = d-x^ ... (51) 

a in equation (51) is a predetermined constant, and this constant a 
becomes 3/xm^ on the basis of the definition of the freshness function of 
equation (43). Therefore, the freshness function F(x) of equation (5 1) is 
expressed based on equation (52): 



F(x)=^.x^ ...(52) 

Xm 
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Herein, the freshness function F(x) expressed by equation (52) is 
shown in Fig. 13. 

In this case, the no-speech sound model Gs after adaptation is 
determined based on equation (53): 

5 

Gs=4r^'GxaXx.aSdx ...(53) 

Xm "^^ 

Next, as the freshness function F(x), for example, a second-order 
function as a high-order fimction which takes discrete values can be used. 
10 In this case, the freshness function F(x) is expressed based on, for example, 
equation (54): 

F(x) = a • X2 X = 1, 2, Xm (54) 

15 a in equation (54) is a predetermined constant, and this constant a 

becomes 6/(xm(xm+1)(2xm+1)) on the basis of the definition of the 
freshness ftmction of equation (43). Therefore, the freshness function F(x) 
of equation (54) is expressed based on equation (55): 

20 F(x)^ , ...(55) 
^ ^ xm(x«+1)(2xm-^1) 

Herein, the freshness function F(x) expressed by equation (55) is 
shown in Fig. 14. 

In this case, the no-speech soxmd model Gs after adaptation is 
25 determined based on equation (56): 

Xm 2 

°-'§ x.fa-.l)(2x.^l) °' ■■■^''^ 

Next, in a case where as the freshness function F(x), for example, a 
30 logarithmic function which takes continuous values is used, the freshness 
function F(x) is expressed based on, for example, equation (57): 



F(X) = a-log(x+l) 



.(57) 
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a in equation (57) is a predetermined constant, and this constant a 
becomes l/((xM+l)log(xM+l) - xm) on the basis of the definition of the 
fi-eshness function of equation (43). Therefore, the fireshness function F(x) 
5 of equation (57) is expressed based on equation (58): 

H^)^7—r^^r-}—Zri log(x + l) -.(58) 

^ ^ (xM + l)log(xM + l)-XM 

Herein, the fireshness function F(x) expressed by equation (58) is 
10 shown in Fig. 15. 

In this case, the no- speech sound model Gs after adaptation is 
determined based on equation (59): 



1 rXM ^ 



(59) 



Next, as the fireshness function F(x), for example, a logarithmic 
function which takes discrete values can be used. In this case, the fi-eshness 
function F(x) is expressed based on, for example, equation (60): 

F(x) = a-log(x+l) X = 1, 2, Xm -. (60) 



a in equation (60) is a predetermined constant, and this constant a is 
determined on the basis of the definition of the freshness function of 
25 equation (43). Therefore, the fireshness function F(x) of equation (60) is 
expressed based on equation (61): 

F(?^) = log(x + l) .-.(61) 

y=l 



30 



Herein, the fireshness function F(x) expressed by equation (61) is 
shown in Fig. 16. 



In this case, the no-speech sound model Gs after adaptation is 
detexmined based on equation (62): 

Gs = ^ £log(x + l)-Gx ••(62) 

logflcy+i) 

5 

Next, in a case where as the freshness function F(x), for example, a 
general high-order function which takes continuous values is used, the 
freshness function F(x) is expressed based on, for example, equation (63): 

10 F(x) = a-xP -.(63) 

a in equation (63) is a predetermined constant, and the degree of the 
freshness function F(x) is detennined by p. 

The constant a can be determined on the basis of the definition of 
15 the freshness function of equation (43). Therefore, the freshness function 
F(x) of equation (63) is expressed based on equation (64): 

F(x) = £±i.x'' .-(64) 

Xm 

20 In this case, the no-speech sound model Gs after adaptation is 

determined based on equation (65): 

Gs^^r'^-GxK.aDdx ...(65) 

Xm 

25 In equation (64), for example, when p is 1 or 2, the freshness 

function F(x) is a linear function or a second-order function which takes 
continuous values, and is expressed as shown in equation (46) or (52). 

Furthermore, in equation (64), for example, when p is 3, the 
freshness function F(x) is a third-order function which takes continuous 

30 values and is expressed as shown in equation (66): 



20 



25 



F(x) = ^x^ ...(66) 



Xm 



Furthermore, in equation (64), for example, when p is 4, the 
freshness function F(x) is a fourth-order function which takes continuous 
values and is expressed as shown in equation (67): 



F(x)-4.x^ -..(67) 

Xm 



Next, in a case where as the freshness function F(x), for example, a 
10 general high-order function which takes discrete values is used, the 
freshness function F(x) is expressed based on, for example, equation (68): 

F(x)-a.xP x=l, 2, ...,xm -.(68) 

15 a ia equation (68) is a predetermined constant, and the order of the 

freshness function F(x) is determined by p. 

. The constant a can be determined on the basis of the definition of 
the freshness function of the equation (43). Therefore, the freshness 
function F(x) of equation (68) is expressed based on equation (69): 



F(x) = ^ ...(69) 



In this case, the no-speech sound model Gs after adaptation is 
determined based on equation (70): 



30 



In equation (69), for example, when p is 1 or 2, the freshness 
function F(x) is a linear function or a second-order fimction which takes 
discrete values, and is expressed as shown in equation (49) or (55). 
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In addition, in equation (69), for example, when p is 3, the freshness 
function F(x) is a third-order function which takes discrete values and is 
expressed as shown in equation (77): 

F(x)= ^ , ...(71) 

Xm(xm-s-1) 

Furthermore, in equation (69), for example, when p is 4, the 
freshness fixnction F(x) is a fourth-order fimction which takes discrete 
values and is expressed as shown in equation (72): 

F(x) = ^ ^ . . - (72) 

Xm(xm + 1)(2xm + 1)(3xS + 3xm-1) 



The concept of the freshness function F(x) can be apphed to the 
adaptation of a no-speech sound model, and in addition, to adaptation to the 

15 person speaking in a noisy environment and to the adaptation of a sound 
model other than a no-speech soimd model. In addition, it is also possible 
to apply the concept of the freshness function F(x) to speech detection and 
non-stationary noise detection. Fmthermore, also in the field of soimd 
signal processing, image signal processing, and conunimication, use of the 

20 concept of the freshness function F(x) makes it possible to improve 
robustness against environmental noise and to improve system 
performance. 

In the foregoing, although a speech recognition apparatus to which 
the present invention is applied has been described, such a speech 
25 recognition apparatus can be applied to, for example, a car navigation 
apparatus capable of accepting speech input, and other various types of 
apparatuses. 

In this embodiment, a feature distribution parameter in which 
distribution characteristics of noise are taken into consideration is 
30 determined. This noise includes, for example, noise from the outside in an 
environment in which speech is produced, and in addition, includes, for 
example, characteristics of a connnunication hne in a case where speech is 
recognized which was transmitted via a telephone line or other 
communication lines. 



# • 



Furthermore, the present invention can also be applied to a case in 
which, in addition to speech recognition, image recognition and other 
pattern recognitions are performed. 

For iastance, the teachings of the ravention can also be transposed to 
5 pattem recognition systems and method in such application as: 

• object identification and sorting, e.g. in robotics computer-aided 
assembling, identification of persons or vehicles, etc.; 

• document authentification ; 

• optical handwriting recognition, 
10 • etc. 

In addition, although in this embodiment an adaptation of a no- 
speech sound model is performed by using a no-speech feature distribution 
represented as a distribution in a feature space, the adaptation of a no- 
speech sound model can also be performed by using features of noise 

1 5 represented as a point in a feature space. 

Next, the above-described series of processing can be performed by 
hardware and can also be performed by software. In a case where the series 
of processing is performed by software, programs which form the software 
are installed into a general-purpose computer, etc. 

20 Accordingly, Fig. 17 shows an example of the construction of an 

embodiment of a computer into which the programs which execute the 
above-described series of processing are installed. 

The programs may be recorded in advance in a hard disk 105 or a 
ROM 103 as a recording medium contained in the computer. 

25 Alternatively, the programs may be temporarily or permanently 

stored (recorded) in a removable recording medium 111, such as a floppy 
disk, a CD-ROM (Compact Disc Read Only Memory), an MO (Magneto- 
optical) disk, a DVD (Digital Versatile Disc), a magnetic disk, or a 
semiconductor memory. Such a removable recording medium 111 may be 

30 provided as what is commonly called package software. 

In addition to being installed into a computer from the removable 
recording medium 111 such as that described above, programs may be 
transferred in a wireless manner from a download site via an artificial 
satellite for digital satellite broadcasting or may be transferred by wire to a 

35 computer via a network, such as a LAN (Local Area Network) or the 
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Internet, and in the computer, the programs which are transferred in such a 
manner are received by a communication section 108 and are installed into 
the hard disk 105 contained therein. 

The computer has a CPU (Central Processing Unit) 102 contained 
5 therein. An input/output interface 110 is connected to the CPU 102 via a 
bus 101. When a command is input as a result of the user operating an 
input section 107 formed of a keyboard, a mouse, etc., via the input/output 
interface 110,"the CPU 102 executes a program stored in a ROM (Read 
Only Memory) 103 in accordance with the conunand. Alternatively, the 

10 CPU 102 loads a program stored in the hard disk 105, a program which is 
transferred from a satellite or a network, which is received by the 
communication section 108, and which is installed into the hard disk 105, 
or a program which is read from the removable recording medium 111 
loaded into a drive 109 and which is installed into the hard disk 105, to a 

15 RAM (Random Access Memory) 104, and executes the program. As a 
result, the CPU 102 performs processing performed according to the 
constructions in the above-described block diagrams. Then, the CPU 102 
outputs the processing result from a display section 106 formed of an LCD 
(Liquid Crystal Display), a speaker, etc., for example, via the input/output 

20 interface 110, as required, or transmits the processing result from the 
conununication section 108, and furthermore, records the processing result 
in the hard disk 105. 

Herein, in this specification, processing steps which describe a 
program for causing a computer to perform various types of processing 

25 need not necessarily perform processing in a time series along the described 
sequence as a flowchart and to contain processing performed in parallel or 
individually (for example, parallel processing or object-oriented processing) 
as well. 

Furthermore, a program may be such that it is processed by one 
30 computer or may be such that it is processed in a distributed manner by 
plural computers. In addition, a program may be such that it is transferred 
to a remote computer and is executed thereby. 

According to the model adaptive apparatus and the model adaptive 
method, the recording medium, and the pattern recognition apparatus of the 
35 present invention, an adaptation of a predeteimined model is performed 



based on extracted data in a predetermined interval and the degree of 
freshness representing the recentness of the extracted data. Therefore, by 
performing pattern recognition using the model, it is possible to improve 
recognition performance. 
5 Many different embodiments of the present invention may be 

constructed v^thout departing from the spirit and scope of the present 
invention. It should be understood that the present invention is not limited 
to the specific embodiment described in this specification. To the contrary, 
the present invention is intended to cover various modifications and 
10 equivalent arrangements within the scope of the invention as hereafter 
claimed. 



